Snorkel: Rapid Training Data Creation with Weak Supervision

نویسندگان

  • Alexander Ratner
  • Stephen H. Bach
  • Henry R. Ehrenberg
  • Jason Alan Fries
  • Sen Wu
  • Christopher Ré
چکیده

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train stateof-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets. PVLDB Reference Format: A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, 11 (3): xxxx-yyyy, 2017. DOI: 10.14778/3157794.3157797

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Snorkel: Beyond Hand-labeled Data

This talk describes Snorkel, a software system whose goal is to make routine machine learning tasks dramatically easier. Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets for a user’s task. In Snorkel, a user implicitly defines large training sets by writing simple programs that create labeled data, instead of tediously hand-...

متن کامل

A Brief Introduction to Weakly Supervised Learning

Supervised learning techniques construct predictive models by learning from a large number of training examples, where each training example has a label indicating its ground-truth output. Though current techniques have achieved great success, it is noteworthy that in many tasks it is difficult to get strong supervision information like fully ground-truth labels due to the high cost of data lab...

متن کامل

Snorkel: A System for Lightweight Extraction

We describe a vision and an initial prototype system for extracting structured data from unstructured or dark input sources–such as text, embedded tables, images, and diagrams–called Snorkel, in which users write traditional extraction scripts which are automatically enhanced by machine learning techniques. The key technical idea is to view the user’s actions with standard tools as implicitly d...

متن کامل

Socratic Learning: Correcting Misspecified Generative Models using Discriminative Models

A challenge in training discriminative models like neural networks is obtaining enough labeled training data. Recent approaches use generative models to combine weak supervision sources, like user-defined heuristics or knowledge bases, to label training data. Prior work has explored learning accuracies for these sources even without ground truth labels, but they assume that a single accuracy pa...

متن کامل

Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

Information extraction (IE) holds the promise of generating a large-scale knowledge base from the Web’s natural language text. Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors. Recently, researchers have developed multiinstance lear...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2017